Identifying personal genomes by surname inference.
نویسندگان
چکیده
Sharing sequencing data sets without identifiers has become a common practice in genomics. Here, we report that surnames can be recovered from personal genomes by profiling short tandem repeats on the Y chromosome (Y-STRs) and querying recreational genetic genealogy databases. We show that a combination of a surname with other types of metadata, such as age and state, can be used to triangulate the identity of the target. A key feature of this technique is that it entirely relies on free, publicly accessible Internet resources. We quantitatively analyze the probability of identification for U.S. males. We further demonstrate the feasibility of this technique by tracing back with high probability the identities of multiple participants in public sequencing projects.
منابع مشابه
Excavating past population structures by surname-based sampling: the genetic legacy of the Vikings in northwest England.
The genetic structures of past human populations are obscured by recent migrations and expansions and have been observed only indirectly by inference from modern samples. However, the unique link between a heritable cultural marker, the patrilineal surname, and a genetic marker, the Y chromosome, provides a means to target sets of modern individuals that might resemble populations at the time o...
متن کاملVariation in Hispanic Self-Identification, Spanish Surname, and Geocoding: Implications for Ethnicity Data Collection
This study examines the variation in surname analysis and geocoding, and their association with self-identified Hispanics in an HMO. We collected ethnicity data from three studies, and employed Spanish surname software and census tract level geocoding to create proxies for Hispanic ethnicity. We computed sensitivity, specificity, and estimated multivariate logistic regression models to examine ...
متن کاملAssessing record linkage between health care and Vital Statistics databases using deterministic methods
BACKGROUND We assessed the linkage and correct linkage rate using deterministic record linkage among three commonly used Canadian databases, namely, the population registry, hospital discharge data and Vital Statistics registry. METHODS Three combinations of four personal identifiers (surname, first name, sex and date of birth) were used to determine the optimal combination. The correct linka...
متن کاملEvolutionary inference across eukaryotes identifies multiple pressures favoring mitochondrial gene retention
Since their endosymbiotic origin, mitochondria have lost most of their genes. Although many selective mechanisms underlying the evolution of mitochondrial genomes have been proposed, a data-driven exploration of these hypotheses is lacking, and a quantitatively supported consensus remains absent. We developed HyperTraPS, a methodology coupling stochastic modelling with Bayesian inference, to id...
متن کاملRecruiting Hispanic women for a population-based study: validity of surname search and characteristics of nonparticipants.
Conducting research on the health of Hispanic populations in the United States entails challenges of identifying individuals who are Hispanic and obtaining good study participation. In this report, identification of Hispanics using a surname search and ethnicity information collected by cancer registries was validated, compared with self-report, for breast cancer cases and controls in Utah and ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Science
دوره 339 6117 شماره
صفحات -
تاریخ انتشار 2013